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<D . 

Ph ' We show that linear probing requires 5-independent hash functions for expected constant- 

time performance, matching an upper bound of [Pagh et al. STOC'07]. More precisely, we 
£N) ' construct a 4-independent hash functions yielding expected logarithmic search time. For (1 + 

e)-approximate minwise independence, we show that 0(lg ^-independent hash functions are 
required, matching an upper bound of [Indyk, SODA'99]. We also show that the very fast 2- 
, independent multiply-shift scheme of Dietzfelbinger [STACS'96] fails badly in both applications. 

1 Introduction 

The concept of /c-wise independence was introduced by Wegman and Carter [23] in FOCS'79 and 
^ — , has been the cornerstone of our understanding of hash functions ever since. Formally, a family 

CN \ % = {h : [u] — >■ [t]} of hash functions is fc-independent if (1) for any distinct keys x±, . . . ,Xk G [it], 

the hash codes h(xi), . . . , h(xk) are independent random variables; and (2) for any fixed x, h(x) is 
uniformly distributed in [t]. 

As the concept of independence is fundamental to probabilistic analysis, ^-independent func- 
tions are both natural and powerful in algorithm analysis. They allow us to replace the heuristic 
assumption of truly random hash functions with real (implementable) hash functions that are still 
"independent enough" to yield provable performance guarantees. We are then left with the natural 
^ ' goal of understanding the independence required by algorithms. 

When first we have proved that fc-independence suffices for a hashing-based randomized algo- 
rithm, then we are free to use any /c-independent hash function. The canonical construction of a 
fe-independent family is based on polynomials of degree k — 1. Let p > u be prime. Picking random 
ao, . . . , ak-\ £ {0, . . . ,p — 1}, the hash function is defined by: 

h{x) = ^(afc-ix*"'" 1 + • • • + a±x + ao) mod p^j mod t 

For p> t, the hash function is statistically close to fe-independent. 

Sometimes 2-independence suffices. For instance, if one implements a hash table by chaining, the 
time it takes to query x is proportional to the number of keys y colliding with x (i.e. h{x) = h(y)). 
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Table 1: Expected time bounds for linear probing with a bad family of k- independent hash functions. 
Construction time refers to the total time to insert n keys starting from an empty table. 

Thus, pairwise independence of h(x) and h(y) is all we need for expected constant query time. We 
note that 2-independence also suffices for the 2-level hashing of Fredman et al. [8] , yielding static 
hash tables with constant query time. 

At the other end of the spectrum, 0(lgra)-independence suffices in a vast majority of applica- 
tions. One reason for this is the Chernoff bounds of [18] for ^-independent events, whose probability 
bounds differ from the full-independence Chernoff bound by 2~^( fc ). Another reason is that random 
graphs with 0(lgn)-independent edges p] share many of the properties of truly random graphs. 

In this paper, we study two compelling applications in which independence bigger than 2 is 
currently needed: linear probing and minwise-independent hashing. (The reader unfamiliar with 
these applications will find more details below.) For linear probing, Pagh et al. [13] showed that 
5-independence suffices, thus giving the first realistic implementation of linear probing with formal 
guarantees. For minwise-independence, Indyk [10] showed that ^-approximation can be obtained 
using 0(lg ^-independence. 

In both cases, it was known that 2-independence does not suffice P21E], and, indeed, the simplest 
family x i— > {ax + b) mod p provides a counterexample. However, a significant gap remained to the 
upper bounds. 

In this paper, we close this gap, showing that both upper bounds are, in fact, tight. We do this by 
exhibiting carefully constructed families for which these algorithms fail: for linear probing, we give 
a 4-independent family that leads to fi(lgn) expected query time; and for minwise independence, 
we give an 0(lg ^-independent family that leads to 2e approximation. In fact, we will present a 
complete understanding of linear probing with low independence as summarized in Table [TJ 

Concrete schemes. Our results give a powerful understanding of a natural combinatorial re- 
source (independence) for two important algorithmic questions. In other words, they are limits on 
how far the paradigm of independence can bring us. Note, however, that independence is only one 
property that concrete hash schemes have. In a particular application, a hash scheme can behave 
much better than its independence guarantees, if it has some other probabilistic property unrelated 
to independence. Obviously, proving that a concrete hashing scheme works is not as attractive as 
proving that every fc-independent scheme works, including more efficient ^-independent schemes 
found in the future. However, if low independence does not work, then a concrete scheme may be 
the best we can hope for. 

The most practical 2-independent hash function is not the standard x i— > ((ax +6) mod pj mod t, 
but Dietzfelbinger's multiply-shift scheme [6], which on some computers is 10 times as fast [20]. To 
hash w-bit integers to £-bit integers, £ < w, the scheme picks two random 2u;-bit integers a and 6, 
and computes (ax + b) >> (2w — £), where >> denotes unsigned shift. 

In this paper, we prove that linear probing with multiply-shift hashing suffers from O(lgra) 
expected running times on some input. Similarly, we show that minwise independent hashing may 
have a very large approximation error of e = f2(lgn). While these results are not surprising, given 
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the "moral similarity" of multiply-shift and ax + b mod p schemes, they do require rather involved 
arguments. We feel this effort is justified, as it brings the theoretical lower bounds in line with 
programming reality. 

Later work. In our continued search for efficient hashing schemes with good theoretical prop- 
erties, we later considered simple tabulation hashing [16], which breaks fundamentally from poly- 
nomial hashing schemes. Tabulation based hashing is comparable in speed to multiply-shift hash- 
ing [6j, but it uses much more space (polynomial instead of constant). It is only 3-independent, yet 
it does give constant expected time for linear probing and o(l)-approximate minwise hashing. It 
is the negative findings of the current paper that motivated us to look into the much more space 
consuming tabulation hashing. 

The problems discovered here for minwise hashing with low independence also lead the second 
author to consider alternatives more suitable for low independence |21j . 

1.1 Technical Discussion: Linear Probing 

Linear probing uses a hash function to map a set of keys into an array of size t. When inserting x, 
if the desired location h(x) is already occupied, the algorithm scans h{x) + 1, h(x) + 2, . . . until an 
empty location is found, and places x there. The query algorithm starts at h(x) and scans either 
until it finds x, or runs into an empty position, which certifies that x is not in the hash table. We 
assume constant load of the hash table, e.g. the number of keys is n < |i. 

This classic data structure is one of the most popular implementation of hash tables, due to its 
unmatched simplicity and efficiency. The practical use of linear probing dates back at least to 1954 
to an assembly program by Samuel, Amdahl, Boehme (c.f. [12] ). On modern architectures, access 
to memory is done in cache lines (of much more than a word), so inspecting a few consecutive 
values typically translates into just one memory probe. Even if the scan straddles a cache line, the 
behavior will still be better than a second random memory access on architectures with prefetching. 
Empirical evaluations [2j [9J Q3] confirm the practical advantage of linear probing over other known 
schemes, while cautioning [9J [22] that it behaves quite unreliably with weak hash functions (such 
as 2-independent). Taken together, these findings form a strong motivation for theoretical analysis. 

Linear probing was first shown to take expected constant time per operation in 1963 by 
Knuth [TT], in a report now considered the birth of algorithm analysis. However, this required 
truly random hash functions. 

A central open question of Wegman and Carter [23J was how linear probing behaves with k- 
independence. Siegel and Schmidt [17} [T9] showed that 0(lgn)-independence suffices. Recently, 
Pagh et al. [13] showed that even 5-independent hashing works. We now close this line of work, 
showing that 4-independence is not enough. 

Review of the 5-independence upper bound. To better situate our lower bounds, we begin 
by reviewing the upper bound of [13j . Our proof here is much simpler than the one in [13J but 
assumes a load factor below 2/3. A more elaborate proof considering all load factors is presented 
in [16]. 

The main probabilistic tool featuring in this analysis is a 4 th moment bound. Consider throwing 
n balls into t bins uniformly. Let Xi be the probability that ball i lands in the first bin, and 
X = ^22=1 the number of balls in the first bin. We have fi = ELY] = j. Then, the k th moment 
of X is defined as E[(X - //)*]. 



3 



As long as our placement of the balls is /c-independent, the k th moment is identical to the case 
of full independence. For instance, the 4 th moment is: 

E[(X - /i) 4 ] = E[( £(A, - I)) 4 ] = £ E[(* - \){X 3 - \){X k - \){ Xl - I)] . 

* i->j fell 

The only question in calculating this quantity is the independence of sets of at most 4 items. Thus, 
4-independence preserves the 4 th moment of full randomness. Since E[pQ — j)] = 0], 

E[(X - ^) 4 ] = £ E [(* - I) 4 ]. + (f) £ E [(* - }) 2 ( Xj - |) 2 ]. 

Moments are a standard approach for bounding the probability of large deviations. Let's say that 
we expect \i items in the bin, but have capacity 2/z; what is the probability of overflow? A direct 
calculation shows that the 4 th moment is E[(A — /Lt) 4 ] = 0(y?). Then, by a Markov bound, the 
probability of overflow is Pr[A > 2/j] = Pr[(X — fi) 4 > [i 4 ] = 0(1/ /J, 2 ). By contrast, if we only have 
2-independence, we can use the 2 nd moment E[(A- / u) 2 ] = 0(/i) and obtain PrLY > 2//] = 0(1/ ji). 
Observe that the 3 rd moment is not useful for this approach, since (X — fi) 3 can be negative, so 
Markov does not apply. 

To apply moments to linear probing, we consider a perfect binary tree spanning the array [t] 
where t is a power of two. For notational convenience, we assume that the load factor is 1/3, that 
is, n = t/3. A node at height h < log 2 t has an interval of 2 h array positions below it, and is 
identified with this interval. We expect at most 2 h /3 keys to be hashed to the interval (but more 
or less keys may end in the interval, since items are not always placed at their hash position). Call 
the node "near- full" if at least |2 ft keys hash to its interval. 

We will now bound the total time it takes to construct the hash table (the cost of inserting n 
distinct items). A run is an maximal interval of filled locations. If the table consists of runs of 
k\, &2, • • • keys (^ fcj = n), the cost of constructing it is bounded from above by 0(k 2 + k\ + . . . ). 
To bound these runs, we make the following crucial observation: if a run contains between 2 h and 
2 h + 1 keys, then some node at height h — 2 above it is near-full. In fact, there will be such a near-full 
height h — 2 node whose last position is in the run. 

For a proof, we study a run of length at least 2 h . The run is preceded by an empty position, 
so all keys in the run are hashed to the run (but may appear later in the run than the position 
they hashed to). There are at least 4 consecutive height h — 2 nodes with their last position in 
the interval. Assume for a contradiction that none of these are near-full. The first node (whose 
first positions may not be in the run) contributes less than | 2 h ~ 2 keys to the run (in the most 
extreme case, this many keys hash to the last position of that node). The subsequent nodes have 
all 2 h ~ 2 positions in the run, but with less than | 2 h ~ 2 keys hashing to these positions. Even with 
the maximal excess from the first node, we cannot fill the intervals of two subsequent nodes, so the 
run must stop before the end of the third node, contradicting that its last position was in the run. 

Each node has its last position in at most one run, so the observation gives an upper bound on 
the cost: add 0(2 2h ) for each near-full node at some height h. Denoting by p(h) the probability 
that a node on height h is near-full, the expected total cost over all heights is thus bounded by 

E! 1 =o*(*/ 2 ' 1 ) • P( h ) • 22h = °(. n • ESo* 2h • P( h ))- Usin S the 2nd moment to bound p(h), we obtain 
p(h) = 0(2~ h ), so the total expected cost with 2-independence is O(nlgre). However, the 4 th 
moment gives p(h) = 0(2~ 2h ), so the total expected cost with 4-independence is 0(n). 
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To bound the running time of one particular operation (query or insert q), we want q to be 
independent of the construction. Thus, if the hash function is k- independent, then we view the hash 
of q as independent of a {k — l)-independent construction. The expected time is then the average 
distance from a position to the end of the run containing it. There is also a constant to be added 
for empty positions, but we ignore that below. The average cost is then a fraction 1/t (< 1/n) 
of the sum of the squared run lengths, which is exactly what we bounded above. Thus, with 3- 
independent hashing, the expected operation time is O(logra) while with 5-independent hashing, 
the expected operation time is 0(1). For load factors below 2/3, this establishes the upper bounds 
from Table Q] except for the 0(y/n) search time with 2-independent hashing. 

Our results. Two intriguing questions pop out of this analysis. First, is the independence 
of the query really crucial? Perhaps one could argue that the query behaves like an average 
operation, even if it is not completely independent of everything else. Secondly, one has to wonder 
whether 3-independence suffices (by using something other than 3 rd moment): all that is needed 
is a bound slightly stronger than 2 nd moment in order to make the costs with increasing heights 
decay geometrically! 

We answer both questions in strong negative terms. The complete understanding of linear 
probing with low independence is summarized in Table [TJ Addressing the first question, we show 
that 4-independence cannot give expected time per operation better than Q(lgn), even though n 
operations take 0{n) time. Our proof demonstrates an important phenomenon: even though most 
bins have low load, a particular key's hash code could be correlated with the (uniformly random) 
choice of which bins have high load. 

An even more striking illustration of this fact happens for 2-independence: the query time blows 
up to Q(y/n) in expectation, since we are left with no independence at all after conditioning on 
the query's hash. This demonstrates a very large separation between linear probing and collision 
chaining, which enjoys 0(1) query times even for 2-independent hash functions. 

Addressing the second question, we show that 3-independence is not enough to guarantee even 
a construction time of 0{n). Thus, in some sense, the 4 th moment analysis is the best one can hope 
for. 

1.2 Technical Discussion: Minwise Independence 

This concept was introduced by two classic algorithms: detecting near-duplicate documents 0] 
and approximating the size of the transitive closure [5J. The basic step in these algorithms is 
estimating the size of the intersection of pairs of sets, relative to their union: for A and B, we 
want to find j^gw (the Jaccard similarity coefficient). To do this efficiently, one can choose a 
hash function h and maintain min h{A) as the sketch of an entire set A. If the hash function is 
truly random, we have Pr[min/i(A) = min h(B)] = pjjjgj- Thus, by repeating with several hash 
functions, or by keeping the bottom k keys with one hash function, the Jaccard coefficient can be 
estimated up to a small approximation. 

To make this idea work, the property that is required of the hash function is minwise indepen- 
dence. Formally, a family of functions T~L = {h : [u] — > [u]} is said to be minwise independent if, for 
any set S C [u] and any x ^ S, we have Pr/ lg -^[/i(x) < mmh(S)] = js\+±- I n other words, x is the 
minimum of S U {x} only with its "fair" probability mrr^- 

As good implementations of exact minwise independent functions are not known, the definition is 
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relaxed to e-minwise independent, where Prhen [h(x) < mmh(S)] = i gfe • Using such a function, 

we will have Pr[min/i(A) = mmh(B)] = (1 ± e) raggj ■ Thus, the e parameter of the minwise 
family dictates the best approximation achievable in the algorithms (which cannot be improved by 
repetition). 

Indyk |10| gave the only implementation of minwise independence with provable guarantees, 
showing that 0(lg ^-independent functions are e-minwise independent. 

His proof uses another tool enabled by /^-independence: the inclusion-exclusion principle. Say 
we want to bound the probability that at least one of n events is "good." We can define p(k) = 
Ssc[n] \s\=k P r [ an $ are good]. Then, the probability that at least one event is good is, by inclusion- 
exclusion, p(l)— p(2)+p(3)— p(4) + . ... If we only have ^-independence {k odd), we can upper bound 
the series by p(l) — p(2) + • • • + 0(p(k)). In the common scenario that p(k) decays exponentially 
with k, the trimmed series will only differ from the full independence case by 2~^ k \ Thus, k- 
independence achieves bounds exponentially close to full independence, whenever probabilities can 
be computed by inclusion-exclusion. This turns out to be the case for minwise independence: we 
can express the probability that at least some key in S is below x by inclusion-exclusion. 

In this paper, we show that, for any e > 0, there exist f2(lg ^-independent hash functions that 
are no better than e-minwise independent. Indyk's |10| simple analysis via inclusion-exclusion is 
therfore tight: e-minwise independence requires S7(lg i) independence. 

2 Linear probing with ^-independence 

We will now present our analysis of linear probing with different degrees of independence. We 
will present negative results complementing the positive findings from [13], reviewed above. When 
lower bounding query times, for simplicity we assume that the query is not among the stored keys. 
The cost of such an unsuccessful search is the distance to the end of the run the query hashes to. 

2.1 Expected query time Q(y/n) with 2-Independence 

Above we saw that the expected construction time with 2-independence is O(nlogn), so the average 
cost per key is O(logn). We will now define a 2-independent hash family such that the expected 
query time for some concrete key is 0(y / n). The main idea of the proof is that the query can play a 
special role: even if most portions of the hash table are lightly loaded, the query can be correlated 
with the portions that are loaded. We assume that t is an odd power of two, and we store n = t/2 
keys. Then \fn is also a power of two. 

We think of the stored keys and the query key as given, and we want to find bad ways of 
distributing them 2-independently into the range [t]. To extend the hash function to the entire 
universe, all other keys are hashed totally randomly. We consider unsuccessful searches, i.e. the 
search key q is not stored in the hash table. The query time for q is the number of cells considered 
from h(q) up to the first empty cell. If, for some d, the interval Q = (h(q) — d, h(q)] has 2d keys, 
then the search time is Q(d). 

Let d = 2\/n; this is a power of two dividing t. In our construction, we first pick the hash h(q) 
uniformly. We then divide the range into \j~n intervals of length d, of the form (h(q) + i ■ d, h(q) + 
(i + l)d], wrapping around modulo t. One of these intervals is exactly Q. 

We prescribe the distribution of keys between the intervals; the distribution within each interval 
will be fully random. To place 2d = keys in the query interval with constant probability, we 



6 



mix among two strategies with constant probabilities (to be determined): 
Si'. Spread keys evenly, with y/n keys in each interval. 

S2' Pick the query interval Q and three random intervals. Place Ay/n keys in one of these 4 
intervals, and none in the others. All other intervals get y/n keys. 

From the perspective of the stored keys, the 4 intervals are completely random. With prob- 
ability 1/4, it is Q that gets 4a/t7 = 2d keys, overloading it by a factor 2. Then, as described 
above, the search time is Jl(y / n). 

To prove that the hash function is 2-independent, we need to consider pairs of two stored 
keys, and pairs involving the query and one stored key. In either case, we can just look at the 
distribution into intervals, since the position within an interval is truly random. Moreover, by 
symmetry between intervals, we only need to understand the probability of the two keys landing 
in the same interval (which we call a "collision"). We need to balance the strategies so that the 
collision probability is exactly 1 / y/n. 

Since stored keys are symmetric, the probability of q and x colliding is 1/n times the expected 
number of items in Q, which is exactly y/n with both strategies. Thus h(q) and h(x) are independent 
no matter how we mix S\ and S^- 

To analyze pairs (x, y) of stored keys, we compute the expected number of collisions among 
stored keys. We want this number to be (ty/y/n = ^n 1 - 5 — \\fn. In strategy Si, we get the 

smallest possible number of collisions: \fn{^) = |n 1,5 — \n. This is too few by almost re/2. 

In strategy S2, we get (y/n — 4)( v ^™) + (^^) = ^n L5 + ^-n collisions, which is too much by a 
bit more than 5.5n. To get the right expected number of collisions, we use S2 with probability 
Ps 2 = (0 5 5 + 5 5 + 5°io(i))n = TI ^ °(^)- With this mix of strategies, our hashing of keys is 2-independent, 
and since P$ 2 = fi(l), our expected search cost is Q(y/n). 

Upper bound We will now prove a matching upper bound of 0(y/n) on the expected query cost 
with any 2-independent scheme. As in the lower bound, we hash n keys into [t], t = [2n], and 
divide [t] into y/n intervals of length d = 2y/n. We view keys as colliding if they hash to the same 
interval. We want to argue that long runs imply too many collisions for 2-independence. 

The expected number of collisions is iX)/\fn = re 3 / 2 /2 — n 1 ^ 2 /2. The minimum number of 
collisions is with the distribution Si from the lower bound: a perfectly regular distribution with 
n/y/n = y/n keys in each interval, hence y/n • (M^) = n 3//2 /2 — n/2 collisions in total. 

An interval with m keys has m 2 /2 — m/2 collisions and the derivative is m — 1/2. It follows 
that if we move a key from an interval with mi keys to one with 777-2 > mi keys, the number 
of collisions increases by more than 7772 — 777,1. Any distribution can be obtained from the above 
minimal distribution by moving keys from intervals with at most y/n keys to intervals with at least 
y/n keys, and each such move increases the number of collisions. 

A run of length ty/n implies that this many keys hash to an interval of this length. The run 
is contained in less than t/2 + 2 of our length d = 2y/n intervals. In the process of creating 
a distribution with this run from the minimum distribution, we have to move at least ty/n — 
1.5y/n(t/2 + 2) keys to intervals that have already been filled with at least 1.5y/n keys. Keys arc 
always moved from intervals with less than y/n keys, so each move gains at least 0.5-^/n collisions. 
Thus our total gain is at least 

0.5y/n(ty/n - 1.5y/n(t/2 + 2)) = (t/8 - 1.5)n 
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The total number of collisions with a run of length t\fn is therefore at least 

n 3 / 2 /2 - n/2 + (t/8 - 1.5)n = n 3 / 2 /2 - 2n + in/8. 

It follows that if the expected run length is bigger than 16-^/n, then the expected number of collisions 
is bigger than n 3//2 /2, contradicting that their expected number is only n 3//2 /2 — n 1 / 2 /2. 

2.2 Construction Time fi(nlgn) with 3-Independence 

We will now construct a 3-independent family of hash functions, such that the time to insert n 
items into a hash table is O(nlgn). The lower bound is based on overflowing intervals. 

Lemma 1. Suppose an interval [a,b] of length d has d + A stored keys hashing to it. Then the 
insertion cost of these keys is S7(A 2 ). 

Proof. The overflowing A keys will be part of a run containing (b,b + A]. At least [A/2] of 
them must end at position b+ [A/2] or later, i.e., a displacement of at least [A/2]. Interference 
from stored keys hashing outside [a, b] can only increase the displacement, so the insertion cost is 
ft(A 2 ). □ 

We will add up such squared overflow costs over disjoint intervals, demonstrating an expected 
total cost of O(nlgn). 

As before, we assume the array size t is a power of two, and we set n = [|i] . We imagine a 
perfect binary tree spanning the array. The root is level and level I is the nodes at depth I. Our 
hash function will recursively distribute keys from a node to its two children, starting at the root. 
Nodes run independent random distribution processes. Then, if each node makes a ^-independent 
distribution, overall the function is /c-independent. 

For a node, we mix between two strategies for distributing 2m keys between the two children: 

Si'. Distribute the keys evenly between the children. If 2m is odd, a random child gets [m] keys. 

Give all the keys to a random child. 

Our goal is to determine the correct probability for the second strategy, Ps 2 , such that the distri- 
bution process is 3-independent. Then we will calculate the cost it induces on linear probing. First, 
however, we need some basic facts about fc-independence. 

2.2.1 Characterizing fc-Independence 

Our randomized procedure treats keys symmetrically, and ignores the distinction between left / right 
children. We call such distributions fully symmetric. Say the current node has to distribute 2m 
keys to its two children (m need not be integral). Let X a be the indicator random variable for key 
a ending in the left child, and X = J2 a X a - By symmetry of the children, E[X a ] = ^, so E[J] = m. 
The k th moment is = E[(X — m) k \. Also define p^ = Pr[Xi = • • • = = 1] (by symmetry, any 
k distinct keys yield the same value). 

Lemma 2. A fully symmetric distribution is k-independent iff Pi = 2~ % for all i = 2, . . . , k. 
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Proof. For the non-trivial direction, assume pi = 2~ l for all i = 2, . . . , k. We need to show that, for 
any (xi, . . . , x k ) £ {0, l} k , Pr[(Xi = x\) A • • • A (X k = x k )} = 2~ k . By symmetry of the keys, we 
can sort the vector to x\ = ■ ■ ■ = Xt = 1 and xt+i = ■ ■ ■ = x k = 0. Let pk,t be the probability that 
such a vector is seen. 

We use induction on k. In the base case, pi^ = pi^\ = ^ by symmetry. For k > 2, we start 
with pkk = Pk = 2~ k . We then use induction for t = k — 1 down to t = 0. The induction step 
is simply: p M = p fc _ 1)t - p k>t+1 = 2-^^ - 2~ k = 2' k . Indeed, Pr[X L . t = 1 A X t+1 .. fe = 0] can 
be computed as the difference between Pr[Xi.. t = 1 A X t+ i__k-l = 0] (measured by Pk-i,t) an d 
Pr[X L . t = 1 A X< +1 .. fe _i = A Xf, = 1] (measured by p k ,t+i)- □ 

Based on this lemma, we can also give a characterization based on moments. First observe 
that any odd moment is necessarily zero, as Pr[X = m + 5] = Pr[X = m — 5] by symmetry of the 
children. 

Lemma 3. A fully symmetric distribution is k-independent iff its even moments up to F k coincide 
with the moments of the truly random distribution. 

Proof. We will show that p2,...,Pk are determined by i*2, . . . , and vice versa. Thus, any 
distribution that has the same moments as a truly random distribution, will have the same values 
P2, ■ ■ ■ ,Pk as the truly random distribution (pj = 2~ l as in Lemma [2]). 

Let n k = n{n — 1) . . . (n — k + 1) be the falling factorial. The complete dependence between 
P2, ■ ■ ■ ,Pk and i*2, ■ ■ ■ , Fk follows inductively from the following statement: 

F k = (2m) k p k + f k (m,p 2 ,..,Pk-i), for some function f k . 

To see this, note that F k = E[(X - m) k ] = E[X k ] + f(m, E[X 2 }, EpsT^ 1 ]) for some function 

/. But E[X fc ] = {2m) k p k + f\{m, k)p k -\ + f2(m, k)p k -2 + Here, the factors fi(m, k) count the 

number of ways to select k out of 2m keys, with i duplicates. □ 

2.2.2 Mixing the Strategies 

As a general convention, when we are mixing strategies Si, we use Ps^ to denote the probability of 
picking strategy Si while we use a superscript * to denote measures within strategy Si, e.g., F 2 1 
is the second moment when strategy Si is applied. 

By Lemma O a mix of Si and 52 is 3-independent iff it has the correct 2 nd moment F2 = y ■ In 
strategy Si, X = m ± 1 (due to rounding errors if 2m is odd), so F^ 1 < 1. In S2 (all to one child), 
\X — m\ = m so F^ 2 = m 2 . For a correct 2 nd moment of m/2, we balance with P^ 2 = ^- ± O(^j). 

2.2.3 The Construction Cost of Linear Probing 

We now calculate the cost in terms of squared overflows. As long as the recursive steps spread the 
keys evenly with Si, the load factor stays around 2/3: at level I, the intervals have length 2n/2 e 
and 2m = 2/3 • n/2 i ± 1 keys. If now, for a node v on level £, we apply S2 collecting all keys into 
one child, that child interval gets an overflow of 1/3 • n/2 rb 1 = Q(m) keys. By Lemma [Tj the 
keys at the child will have a total insertion cost of 0(m 2 ). Since Ps 2 = G(l/m), the expected cost 
induced by v is f2(m) = n(n/2 l ). This, however, assumes that no ancestor of v was collected. Note 
that it also avoids over-counting when we only charge v is S2 collection is applied to v but to no 
ancestors of v, for then the nodes charged all represent different keys. 
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It remains to bound the probability that S 2 collection has been applied to an ancestor of a 
node v on a given level I < (logn)/2. The collection probability for a node u on level i < £ is 
Ps 2 = 0(l/m) = @(2 l /n) assuming no collection among the ancestors of u. By the union bound, 
the probability that any ancestor u of v is first to be collected is X^=o 8(2*/™) = @{2 i /n) = 
0(1/1/7?) = o(l). We conclude that v has no collected ancestors with probability 1 — o(l), hence 
that the expected cost of v is 0(n/2^) as above. The total expected cost over all 2 e level £ nodes 
is thus O(n). Summing over all levels £ < (logn)/2, we get an expected total insertion cost of 
O(nlogn) for our 3-independent scheme. 

2.3 Expected Query Time fi(lgn) with 4-Independence 

Proving high expected search cost with 4-independence combines the ideas for 2-independence and 
3-independence. However, some quite severe complications will arise. The lower bound is based on 
an overflowing intervals. 

Lemma 4. Suppose an interval [a,b] of length d has d + A, A = 0(d), stored keys hashing to it. 
Assuming that the interval has even length and that the stored keys hash symmetrically to the first 
and second half of [a,b]. Moreover, assume that the query key hash uniformly in [a,b]. Then the 
expected query time is 0(A). 

Proof. By symmetry between the first and the second half, with probability 1/2, the first half gets 
half the keys, hence an overflow of A/2 keys, and a run containing [a + d/2, a + d/2 + A/2). Since 
A = 0(d), the probability that the query key hits the first half of this run is 0(1), and then the 
expected query cost is 0(A). □ 

As for 2-independence, we will first choose h(q) and then make the stored keys cluster prefer- 
entially around h(q). As for 3-independence, the distribution will be described using a perfectly 
balanced binary tree over [t] . The basic idea is to use the 3-independent distribution from Section 
12.21 along the query path. For brevity, we call nodes on the query path for query nodes. The 
overflows that lead to an O(nlogn) construction cost, will yield an O(logn) expected query time. 
However, the clustering of this 3-independent distribution is far too strong for 4-independence. We 
cannot really use it in the top of the tree, but further down, we can balance it with an anti-clustering 
distributions at most of the nodes outside the query path. 

2.3.1 3-independent Building Blocks 

For a node that has 2m keys to distribute, we consider three basic strategies: 

S\: Distribute the keys evenly between the two children. If 2m is odd, a random child gets \m~\ 
keys. 

S 2 '- Give all the keys to a random child. 

53: Pick a child randomly, and give it m + S = \m + ym/2] keys. 

By mixing among these, we define two super-strategies: 
Ti = P S2 x S 2 + (1-Ps 2 ) x Sr, 
T 2 = Ps 3 x S 3 + (l-Ps 3 ) x Si. 



10 



The above notation means that strategy T\ picks strategy 52 with probability Ps 2 ] Si otherwise. 
Likewise T2 picks S3 with probability Ps 3 ; S\ otherwise. The probabilities Ps 2 and Ps 3 are chosen 
such that T\ and T2 are 3-independent. The strategy T\ is the 3-independent strategy from Sec- 
tion E2] where we determined Ps 2 =|±0(^). This will be our preferred strategy on the query 
path. 

To compute Pg 3 , we employ the 2 nd moments: F 2 1 < 1 and F^ 3 = y + 0(y/rn). (If one ignored 
rounding, we would have the precise bounds F 2 Sl = and F^ 3 = § .) By Lemma H we need a 2 
moment of m/2. Thus, we have Ps 3 = 1 — 0(-^=). 

2.3.2 4-Independence on the Average, One Level At The Time 

We are going to get 4-independence by an appropriate mix of our 3-independent strategies T\ and 
T2. Our first step is to hash the query uniformly into [t\. This defines the query path. We will do 
the mixing top-down, one level I at the time. The individual node will not distribute its keys 4- 
independently. Nodes on the query path will prefer T\ while keys outside the query path will prefer 
T2, all in a mix that leads to global 4-independence. There will also be neutral nodes for which we 
use a truly random distribution. Since all distributions are 3-independent regardless of the query 
path, the query hashes independently of any 3 stored keys. We are therefore only concerned about 
the 4-independence among stored keys. 

It is tempting to try balancing of T\ and T2 via 4 th moments using Lemma However, even on 
the same level £, the distribution of the number of keys at the node on the query path will be different 
from the distributions outside the query path, and this makes balancing via 4 th moments non- 
obvious. Instead, we will argue independence via Lemma [2) since we already have 3-independence 
and all distributions are symmetric, we only need to show p^ = 2 -4 . Thus, conditioned on 4 given 
keys a, b, c, d being together on level £, we want them all to go to the left child with probability 2 -4 . 
By symmetry, our 4-tuple (a, b, c, d) is uniformly random among all 4-tuples surviving together on 
level I. On the average we thus want such 4-tuples to go left together with probability 2~ 4 . 

2.3.3 Analyzing T x and T 2 

Our aim now is to compute p 4 x and p 4 2 for a node with 2m keys to be split between its children. 
First we note: _ 

pf> = m7(2m) 4 = £(l-£±2gl) 

Indeed, the first key will go to the left child with probability 5 = 5m - Conditioned on this, the 
second key will go to the left child with probability ^E^, etc. In 52, all keys go to the left child 
with probability a half, so pf 2 = \. Since Ps 2 =^±0(^7), we get 

P? = Ps 2 ■ pf + (1 - PsM 1 = U l + i±^) = 2- 4 + G(l/m). 

To avoid a rather involved calculation, we will not derive p^ 2 directly, but rather as a function of 
the 4 th moment. We have F 4 Sl < 1, Ff 3 = J 4 = \m 2 + 0(m 3 / 2 ), and P Ss = 1 - 0(^=), so 

Fj 2 = Ps 3 F? 3 + (1 - Ps 3 )F^ = im 2 ± 0(m 3 / 2 ). 

From the proof of Lemma [3j we know that F 4 = (2m) 4 p 4 + fk{Tn-,P2,Pz) with any distribution. 
Since T2 is 3-independent, it has the same P2 and p% as a truly random distribution. Thus, we can 
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compute f{m,p2,P3) using the P4 and P4 values of a truly random distribution. The 4 th moment 
of a truly random distribution is: 

2m /4\ (2m) 1 24m 2 - 10m 
F4 ' 



2 4 \2/ 2 4 2 4 



Since P4 = 2 4 in the truly random case, we have: f{m,p2,P3) = 2 4 [(2m) 4 — (24m 2 — 10m)] . Now 



we can return to pj 2 : 



T 2 Fl 2 + f(m,p 2 ,P3) _ 1 /4m 2 ±0(m 3 / 2 ) 24m 2 - 10m \ 

P4 (2m) 4 ~ 24 ^ (2m) 4 + (2m) 4 J 

- ifi- g£^ Uifi-» ± g£n =^-e(i/ m '). 



2 4 V (2m) 4 / 2 4 V m 2 m 2 - 5 

To get P4 = 2~ 4 for a given node, we use a strategy T* that applies T\ with probability P£ « 
5/ (2m); T2 otherwise. However, as stated earlier, we will often give preference to Ti on the query 
path, and to T2 elsewhere. 



2.3.4 The Distribution Tree 

We are now ready to describe the mix of strategies used in the binary tree. On the top | log 2 1 levels, 
we use the above mentioned mix T* of T\ and T2 yielding a perfect 4-independent distribution of 
the keys at each node. 

On the next levels £>\ log 2 t, we will always use Ti on the query path. For the other nodes, 
we use Ti with the probability P^ such that if all non-query nodes on level £ use the strategy 

T~ = P Ti xTi + (l-Pf i )xT 2 ; 

then we get ^4 = 2~ 4 for an average 4-tuple on level £. We note that P^7 depends completely on 
the distribution of 4-tuples at the nodes on level i and that Pj7 has to compensate for the fact that 
Ti is used at the query node. We shall prove the existence of PjT shortly. 

Finally, we have a stopping criteria: if at some level £, we use the S2 collection on the query 
path, or if £ + 1 > | log 2 t, then we use a truly random distribution on all subsequent levels. We 
note that the 5 2 collection could happen already on a top level £<\ log 2 t. 



2.3.5 Possibility of Balance 

Consider a level I before the stopping criteria has been applied. We need to argue that the above 
mentioned probability P^T exists. We will argue that P^T = implies P4 < 2~ 4 while P^7 = 1 
implies P4 > 2~ 4 . Then continuity implies that there exists a P^7 6 [0, 1] yielding P4 = 2~ 4 . 

With PjT = 1, we use strategy Ti for all nodes on the level, and we already know that p^ > 2~ 4 . 

Now consider P Ti = 0, that it, we use Ti only at the query node. Starting with a simplistic 
calculation, assume that all 2 l nodes on level I had exactly 2m = n/2 1 keys, hence the same number 
of 4-tuples. Then the average is 

p Ti + g _ 1)p T2 _ 2 -4 + Q(1/m) _ {2 i _ 1)(2 -4 + e(l/m 2 )) 4 

2 e 2 e < 
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The inequality follows because I > | log 2 t implies 2^ > n 2 / 3 while m < n/2 £ < n 1 / 3 . However, the 
number of keys at different nodes on level £ is not expected to be the same, and we will handle this 
below. 

We want to prove that the average p^ over all 4-tuples on level £ is below 2 -4 . To simplify 
calculations, we can add p^ 1 — 2 -4 = 0(l/m) for each 4-tuple using T\ and p^ 2 — 2 -4 = — 0(l/m 2 ) 
for each tuple using T 2 , and show that the sum is negative. If the query node has 2m keys, all 
using Ti, we thus add (2m) 4 0(l/m) = 0(m 3 ). If a non-query node has 2m keys, we subtract 
(2m) 4 0(l/m 2 ) = G(m 2 ). 

We now want to bound the number of keys at the level £ query node. Since the stopping criteria 
has not applied, we know that S2 collection has not been applied to any of its ancestors. 

Lemma 5. If we have never applied S2 collection on the path to a query node v on level j < | log 2 t, 
then v has n/2 3 ± 3y / n/2-' keys. 

Proof. On the path to v, we have only applied strategies S\ and S3. Hence, if an ancestor of v has 
2m keys, then each child gets m ± (-y/m/2 + 1) keys. The bound follows by induction starting with 
2m = n keys at the root on level 0. □ 

Our level £ query node thus has Q(n/2 e ) keys and contributes 0((n/2^) 3 ) to the sum. 

To lower bound the negative contribution from the non-query nodes on level £, we first note that 
they share all the n — 0(n/2 e ) = Q(n) keys not on the query path. The negative contribution for a 
node with 2m keys is Q(m 2 ). By convexity, the total negative contribution is minimized if the keys 
are evenly spread among the 2^ — 1 non-query nodes, and even less if we distributed on 2^ nodes. 
The total negative contribution is therefore at least 2^ U((n/2 i ) 2 ) = Q,(n 2 /2^). This dominates 
the positive contribution from the query node since (2^) 2 > n 4//3 = uj(n). Thus we conclude that 
Pi < 2 -4 when P^T = 0. This completes the proof that we for level £ can find a value of P^T G [0, 1] 
such that P4 = 2 -4 , hence the proof that the distribution tree described in Section 12.3.41 exists, 
hashing all keys 4-independently. 

2.3.6 Expected Query Time 

We will now study the expected query time. We only consider the cost in the event that S2 collection 
is applied at the query node at some level £ G [| log 2 t, | log 2 t\. This implies that S 2 has not been 
applied previously on the query path, so the event can only happen once with a given distribution 
(no over counting). By LemmaO our query node has n/2^ ± 3\/n/2 e keys. With probability 1/2, 
these all go to the query child which represents an interval of length t/2 e+1 . Since n = 2t/3, we 
conclude that the query child gets overloaded by almost a factor 4/3. By Lemma HI the expected 
search cost is then Q(n/2 e ). 

On the query path on every level i < £, we know that the probability of applying S2 provided 
that S2 has not already been applied is 0(l/m) where m = 0(n/2 l ) by Lemma[5j The probability 
of applying S 2 on level £ G [§ log 2 t, § log 2 t] is therefore (1 - 0(2 i /n))Q(2 £ /n) = Q{2 e /n), so 

the expected search cost from this level is 0(1). Since our event can only happen on one level for a 
given distribution, we sum this cost over the fi(logn) levels in [| log 2 t, | log 2 t]. We conclude that 
the expected search cost of our 4-independent scheme is O(logn). 
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3 Minwise Independence via /c-Independence 



We will show that it is limited how good minwise independence we can achieve based on k- 
independent hashing. For a given k, our goal is to construct a ^-independent distribution over 
n regular keys and a query key q, such that the probability that q gets the minimal hash value is 
^(l + 2~ 0(fc) ). 

We assume that k is even and divides n. Each hash value will be uniformly distributed in the 
unit interval [0,1). Discretizing this continuous interval does not affect any of the calculations 
below, as long as precision 2 lg n or more is used (making the probability of a non-unique minimum 
vanishingly small). 

For our construction, we divide the unit interval into ^ subintervals of the form [i^, (i + 1)^). 
The regular keys are distributed totally randomly between these subintervals. Each subinterval I 
gets k regular keys in expectation. We say that / is exact if it gets exactly k regular keys. Whenever 
/ is not exact, the regular keys are placed totally randomly within it. 

The distribution inside an exact interval I is dictated by a parity parameter P € {0, 1}. We 
break I into two equal halves, and distribute the k keys into these halves randomly, conditioned on 
the parity in the first half being P. Within its half, each key gets an independent random value. 
If P is fixed, this process is k — 1 independent. Indeed, one can always deduce the half of a key x 
based on knowledge of k — 1 keys, but the location of x is totally uniform if we only know about 
k — 2 keys. If the parity parameter P is uniform in {0, 1} (but possibly dependent among exact 
intervals), the overall distribution is still /c-independent. 

The query is generated independently and uniformly. For each exact interval /, if the query is 
inside it, we set its parity parameter Pj = 0. If / is exact but the query is outside it, we toss a 
biased coin to determine the parity, with Pr[Pj = 0] = (^ — ^)/(l — \ )• Any fixed exact interval 
receives the query with probability ^, so overall the distribution of Pj is uniform. 

We claim that the overall process is /c-independent. Uniformity of Pj implies that the distri- 
bution of regular keys is /c-independent. In the case of q and k — 1 regular keys, we also have full 
independence, since the distribution in an interval is (k — l)-independent even conditioned on P. 

It remains to calculate the probability of q being the minimum under this distribution. First 
we assume that the query landed in an exact interval /, and calculate p m i n , the probability that q 
takes the minimum value within /. Define the random variable X as the number of regular keys in 
the first half. By our process, X is always even. 

If X = x > 0, q is the minimum only if it lands in the first half (probability ^) and is smaller 
than the x keys already there (probability jxx)- If X = 0, q is the minimum either if it lands in 
the first half (probability |), or if it lands in the second half, but is smaller than everybody there 
(probability 2(fc 1 +1) ). Thus, 

p min = Pr[X = 0] ■ Q + ^) + £ Pr[X = *].^ 

x=2,4,..,fe 

To compute Pr[X = x], we can think of the distribution into halves as a two step process: first 
k — 1 keys are distributed randomly; then, the last key is placed to make the parity of the first half 
even. Thus, X = x if either x or x — 1 of the first k — 1 keys landed in the first half. In other words: 

Fr [X = x] = { k ~ 1 )/2 k - 1 + (IZI)^- 1 = 0/2^ 
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No keys are placed in the first half iff none of the first k — 1 keys land there; thus Pr[X = 0] 
l/2 fc_1 . We obtain: 

Pmin ~ ¥{kTT) + ¥ ^ ^TT\x 

V ' x=0,2,..,k V 

But = (*i J) • Since + 1 is odd, the sum over all odd binomial coefficients is exactly 

2 k+1 /2 (it is equal to the sum over even binomial coefficients, and half the total). Thus, p m ; n — 
2 fc(fc +1 ) + FRT' i ,e - 9 ^ s * ne m i n i mum with a probability that is too large by a factor of 1 + 2~ k . 

We are now almost done. For q to be the minimum of all keys, it has to be in the minimum non- 
empty interval. If this interval is exact, our distribution increases the chance that q is minimum by 
a factor 1 + 2~ k ; otherwise, our distribution is completely random in the interval, so q is minimum 
with its fair probability. Let Z be the number of regular keys in q's interval, and let £ be the event 
that q's interval is the minimum non-empty interval. If the distribution were truly random, then q 
would be minimum with probability: 

-JL_ = p r [Z = z] ■ Pt[£ \Z = z\- — 
n+1 ^ L J L l J z+1 

z 

In our tweaked distribution, q is minimum with probability: 

V Pi[Z = z] ■ Pr[£ | Z = z] • — !— + Pr[Z = k] ■ Pt[£ \ Z = k] ■ 1 + 2 

z + 1 k + 1 

zf=k 

' + Pr[Z = k] ■ Pr[£ \Z = k\- 2 



n+1 L ' Ll 1 k+l 

But Z is a binomial distribution with n trials and mean k; thus Pr[Z = k) = fi(lyVfc). 
Furthermore, Pr[£ \ Z = k] > ^, since g's interval is the very first with probability ^ (and there 
is also a nonzero chance that it is not the first, but all interval before are empty). Thus, the 
probability is off by an additive term — lxEl_ This translates into a multiplicative factor of 
1 + 2-OW. 



4 Multiply- Shift and Linear Probing 

We show that the simplest and fastest known universal hashing schemes have bad expected perfor- 
mance when used for linear probing on some of the most realistic structured data. This result is 
inspired by negative experimental findings from |22j . The essential form of the schemes considered 
have the following basic form: we want to hash £j n -bit keys into £ ou t-bit indices. Here ti n > tout-, 
and the indices are used for the linear probing array. For the typical case of a half full table, we 
have 2 e -° ut = m ~ In. In particular, m > n. 

Depending on details of the scheme, for some t > £i n ,^out, we pick a random multiplier a £ [2 ], 
and compute 

h a (x) = (ax mod 2 e ) -=- 2 e ~ e ° ut 

We refer to this as the basic multiply- shift scheme. If I € {8,16,32,64}, the mod-operation is 
performed automatically as discarded overflow. The + operation is just a right shift by s — t — &out > 
so in C we get the simple code (a*x)>>s and the cost is dominated by a single multiplication. For 
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+/- 1/2 



Figure 1: Case where ^(5) < s. 

the plain universal hashing in [7], it suffices that t > ti n but then the multiplier a should be odd. 
For 2-independent hashing as in [6], we need i > i% n + i ut — 1- Also we need to add a random 
number b, but as we shall discuss in the end, these details have no essential impact on the derivation 
below. 

Our basic bad example will be where the keys form the interval [n] = {0, n — 1}. However, the 
problem will not go away if this interval is shifted or not totally full, or replaced by an arithmetic 
progression. 

When analyzing the scheme, it is convenient to first consider it as a mapping into the unit 
interval [0,1) via 

h° a (x) = (axmod2 e )/2 e . 

Then h a (x) = [h^ l (x)2 iout \ . We think of the unit interval as circular, and for any x G [0,1), we 
define 

||x|| = min{x mod 1, — x mod 1}. 
This is the distance from in the circular unit interval. 

Lemma 6. Let the multiplier a be given and suppose for some x G {1, ...,n — 1} that \\h^(x)\\ < 
l/(2m). Then, when we use h a to hash [n] into a linear probing table, the average cost per key is 
£l(n/x). 

Proof. The case studied is illustrated in Figured) For each k G [x], consider the set [n]^ = {y G 
[n] | y = k (mod x)}. The q ~ n/x keys from [n]| map to an interval of length (q — l)/(2m), 
which means that the h a distributes [n]f on at most \q/2~\ +1 consecutive array locations. Linear 
probing will have to spread [n]^ on q locations, so on the average, the keys will get a displacement 
of Vt(q) = Q,{n/x). We get a corresponding double full interval for every equivalence class modulo 
x. Therefore we get an average insert cost of VL{n/x) over all the keys. The above average costs 
only measures the interaction between keys from the same equivalence class modulo x. If some of 
these classes overlapped, the cost would only grow. □ 

Note that ||^a(x)|| < l/(2m) implies that h®(x) is contained in an interval of size 1/m around 
0. From the universality arguments of [6] we know that the probability of this event is roughly 
1/m (we shall return with an exact statement and proof later). We would like to conclude that 
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the expected average cost is Y12=l ^(n/x)/m = fi(lgn). The answer is correct, but the calculation 
cheats in the sense that we may have many different x such that ||/ia(x)|| < l/(2m), and the 
associated costs should not all be added up. 

To get a proper lower bound, for any given multiplier a, we let // a denote the minimal positive 
value such that ||/i°0 a )|| < l/(2m). If [i a < n , then by Lemma [6l the average insertion cost is 
n(n/n a ). Therefore, if a is random over some probability distribution (to be played with as we go 
along), the expected average insert cost is lower bounded by 

n^n/x-Pr[x = fi a ]J. (1) 

Lemma 7. For a given multiplier a, consider any x < n such that \\h®(x)\\ < I /(2m). Then x ^ \i a 
if and only if for some prime factor p of x, \\h®(x/p)\\ < \j(2pm). 

Proof. The "if" part is trivial. Since ||^a(Ma)ll — ^-/{^ m )i f° r an y integer i < m, we have 
\\h° a {ifi a )\\ = i\\h°M\\. Hence < l/(2m) \\h°{ji a )\\ < l/(2im). On the other 

hand, suppose < l/(2m) where y is not a multiple of [i a . Then h° a maps {0, ...,y + [i a — 1} 

to points in the cyclic unit interval that are at most l/(2m) apart (c.f. Figure [I]). It follows 
that y > 2m — fi a . However, we are only considering x < n with ||/t„(ic)|| < l/(2m). If 
x 7^ Maj then fi a < x < n < m while y > 2m — fi a > n, so we conclude that x is a multi- 
ple of fx a - Then x = ifi a where || (Ma-) II — l/(2im). Let p be any prime factor of i. Then 
\\h° a (x/p)\\ = Wh°a(i/Pl*a)\\ = i/p\\h° a (fia)\\ < V(2pro). □ 

To illustrate the basic accounting idea, assume for simplicity that we have a perfect distribution 
IA on a that for any fixed x > distributes uniformly in the unit interval. Then for any x 

and £ < 1/2, 

Pr[\\h° a (x)\\ < e] = 2e. (2) 



Then by Lemma [71 



Pr [x = fia] > PT [\\h° a (x)\\ < l/(2m)] - E Pr [||/£(s/p)|| < V(2pm)] 

p prime factor of x 

= 1/m— \/(pm) 

p prime factor of x 

l- Up)/™ (3) 

p prime factor of j; / 

We note that the lower-bound © may be negative since there are values of x for which 
prime factor of x Vp = ©(Iglg^)- Nevertheless ([3]) suffices with an appropriate reordering of 
terms. From ([1]) we get that the expected average insertion cost is: 

vx=l / yx=l y y prime factor p of i 

E vp 2 

prime p=2,3,5,.. 
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Above we simply moved terms of the form —n/(xmp) where p is a prime factor of x to x' = x/p 
in the form —n/(x'mp 2 ). Conservatively, we include —n/(x'mp 2 ) for all primes p even if px' > n. 
Since X^prime p=2 3 5 V.P 2 < 0.453, we get an expected average insertion cost of 



fl ( ^n/x ■ Pr [z = /x ] J = !1 ^ 0.547n/( 

Q(n/m lgn). 



vx=l / \x=l 



xm) 



We would now be done if we had the perfect distribution U on a so that the equality ([2|) was 
satisfied. Instead we will use the weaker statements of the following lemma: 

Lemma 8. Let O the uniform distribution on odd l-bit numbers. For any odd x < n and e < 1/2, 

Pi ■J\K(x)\\<e]<4e (4) 

a<— O 

However, if e is an integer multiple 1/2 , then 

Pr[\\h° a (x)\\<s]=2e. (5) 

a<— O 

Proof. When x is odd and a is a uniformly distributed odd £-bit number, then ax mod 2 e is uni- 
formly distributed odd l-bit number. To get h®(x), we divide by 2 , and then we have a uniform 
distribution on odd multiples of 1/2^. Now ([6]) is immediate because we have exactly one odd 
multiple in each interval [i/2^~ 1 , (i + l)/2 : ]. Also, we maximize Pr a< _o[||/t°(a?)[| < e]/e when we 
capture the points closest to with e = l/2 £ , that is, with Pr a< _o[||/i°(a;)|| < 1/2^] = 4/2 £ . Hence 
dU) follows. □ 

We are now ready to prove our lower bound for the performance of linear probing with the basic 
multiply-shift scheme. 

Theorem 9. Suppose l ou t < £ and that the multiplier a is a uniformly distributed odd l-bit number. 
If we use h a to insert [n] in a linear probing table, then expected average insertion cost is fi(logn). 

Proof. By assumption 1/m = l/2^ out is a multiple of 1/2^ _1 , so for odd x < n, ([5]) implies 

Pi :[\\h° a (x)\\<l/m]=2/m. (6) 

a<r- O 

By Lemma [7] combined with and ([6]), we get that 

Pr [x = M «] > Pr [\\h° a (x)\\ < l/(2m)] - £ Pv [\\h° a (x/p)\\ < l/(2pm)] 

p prime factor of x 



> 1/m— 2/(pm) 

p prime factor of x 
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From dU) we get that the expected average insertion cost is: 

nlf^n/x- Fr u [x = n a }) = Q I £ I n/(xm) I 1 - 2 ]T 1/p J 

\x=l / \ x=l \ \ prime factor p of x J J J 

= n(£[n/(zro) 1-2 ^ 1/p 2 ] ] ] 

\ x=l \ \ prime p=3,5,.. / / / 

= ( 0.594n/(xm) J 

Vodd x=l / 

= Q(n/m Ign). 

Above we again moved terms of the form —n/(xmp) where p is a prime factor of x to x' = x/p in 
the form —n/(x'mp 2 ). Since x is odd, we only have to consider odd primes factors p, and then we 
used that Xlprime p=3 5 V.P 2 < 0.203. This completes the proof of Theorem [9j □ 

We note that the plain universal hashing from [7J also assumes an odd multiplier, so Theorem [9] 
applies directly if £ ou t < £■ The condition £ out < £ is, in fact, necessary for bad performance. If 
£out = £, then h a is a permutation for any odd a, and then linear probing works perfectly. 

For the 2-universal hashing [6] there are two differences. One is that the multiplier may also be 
even, but restricting it to be odd can only double the cost. The other difference is that we add an 
additional ^-bit parameter 6, yielding a scheme of the form: 

h a ,b(x) = [((ax + b) mod 2*)/2^™*J . 

The only effect of b is a cyclic shift of the double full buckets, and this has no effect on the linear 
probing cost. For the 2-independent hashing, we have £ > £i n + £out — 1) so £ < £ ut if £in > 1. 
Hence again we have an expected average linear probing cost of {l(n/m Ign). 

Finally, we sketch some variations of our bad input. Currently, we just considered the set [n] of 
input keys, but it makes no essential difference if instead for some integer constants a and /3, we 
consider the arithmetic sequence a[n] + j3 = {ia + j3 \ x G [n]}. The /3 is just adds a cyclic shift like 
the b in 2-independent hashing. If a is odd, then it is absorbed in the random multiplier a. What 
we get now is that if for some x S [n], we have ||/i^(a:z;)|| < l/(2m), then again we get an average 
cost VL{n/x). A consequence is that no odd multiplier a is universally safe because there always 
exists an inverse a (with aa mod 2^ = 1) leading to a linear cost if h a is used to insert a[n] + /3. 
It not hard to also construct bad examples for even a. If a is an odd multiple of 2 l , we just have 
to strengthen the condition £ ou t 

< £ to 

£out < £ ~ i to get the expected average insertion cost of 
Q(n/m Ign). This kind of arithmetic sequences could be a true practical problem. For example, in 
some denial-of-service attacks, one often just change some bits in the middle of a header key, and 
this gives an arithmetic sequence. 

Another more practical concern is if the input set X is an e- fraction of [n]. As long as e > 2/3, 
the above proof works almost unchanged. For smaller e, our bad case is if ||/t°(a;)|| < e/(2m). In 
that case, for each fee [x], the q = [n/x\ potential keys y from [n] with y mod x = k would map 
to an interval of length e(q — l)/(2m). This means that h a spreads these potential keys on at most 
[eg/2] + 1 consecutive array locations. A e- fraction of these keys are real, so on the average, these 
intervals become double full, leading to cin civercige cost of Vti^ETijx^). Strengthening ^out 

< £ to 
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e > 2 out , we essentially get that all probabilities are reduced by e. Thus we end with a cost of 
0(e 2 n/m lgn) = Q(e\X\/m lgn). 

5 Multiplication- Shift and Minwise Independence 

We now consider the lack of minwise independence with a hashing scheme 

h a ,b{x) = {{ax + b) mod I 1 ). 

Shifting out less significant bits does not make much sense since we are not hashing to entries in 
an array. The analysis will be very similar to the one done for linear probing in Section U and we 
will only sketch it. Now the added b is necessary to get anything meaningful, for without it, zero 
would always get the minimal hash value h a o{0) = 0. The effect of adding b is to randomly spin 
the wheel from Figure [U As in Section [H it is convenient to divide by 2^ to get fractions in the 
cyclic unit interval. We thus define h° ah {x) = h a ^{x)/2 e . 

The bad case will be the interval [n] versus a random query key q. We assume that n is a power 
of two. To see the parallels to Section U think of m = n. For any o, we define fj, a > to be the 
smallest value such that h® (fj, a ) < l/{2n). Then falls in fj, a equidistant intervals, each of 

length at most l/(2/x a ). This leaves us with /j, a equidistant empty intervals, each of length at least 
< l/(2/x a ). Together these empty intervals cover half the cyclic unit interval. When we add b it has 
the effect of placing randomly on the cycle. Having chosen a and b, a random q is also hashed to 
random place on the cycle. The probability that q and end in the same empty interval and with 
q after in the interval is l/(2 3 /i a ). In this case, q has the minimal hash value, but that should 
only have happened with probability 1/n. Thus, relatively speaking, the probability is to high by 
a factor 0(n//i a ), matching the linear probing cost from Section [H As a result we will also end up 
concluding that the expected min-probability is to high by a factor Q(lgn). 

As in Section HI there are some details to consider. It is convenient to restrict ourselves to odd 
values of a, b, x = fM a and q. As a result, all our hash values are odd, and at odd multiples of 
1/2^ in the cyclic unit interval. The analysis goes through as long as t > [~log 2 n], and in fact, we 
can shit our all but the [log 2 u\ most significant bits and yet have the same lower bound that the 
min-probability is too high by a factor O(lgn). 
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